lyu and li
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
Directional convergence and alignment in deep learning
The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows [Shallue et al., 2018], and if the data can be perfectly classified, the
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)
Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks
Cai, Yuhang, Zhou, Kangjie, Wu, Jingfeng, Mei, Song, Lindsey, Michael, Bartlett, Peter L.
Deep networks often have an enormous amount of parameters an d are theoretically capable of overfitting the training data. However, in practice, deep networks trai ned via gradient descent (GD) or its variants often generalize well. This is commonly attributed to the implicit bias of GD, in which GD finds a certain solution that prevents overfitting ( Zhang et al., 2021; Neyshabur et al., 2017; Bartlett et al., 2021). Understanding the implicit bias of GD is one of the central topics in deep learni ng theory. The implicit bias of GD is relatively well-understood when t he network is homogeneous (see Soudry et al., 2018; Ji and Telgarsky, 2018; Lyu and Li, 2020; Ji and Telgarsky, 2020; Wu et al., 2023, and references therein). For linear networks trained on linearly separabl e data, GD diverges in norm while converging in direction to the maximum margin solution ( Soudry et al., 2018; Ji and Telgarsky, 2018; Wu et al., 2023). Similar results have been established for generic homogene ous networks that include a class of deep networks, assuming that the network at initialization can sepa rate the training data.
The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks
Schechtman, Sholom, Schreuder, Nicolas
We analyze the implicit bias of constant step stochastic subgradient descent (SGD). We consider the setting of binary classification with homogeneous neural networks - a large class of deep neural networks with ReLU-type activation functions such as MLPs and CNNs without biases. We interpret the dynamics of normalized SGD iterates as an Euler-like discretization of a conservative field flow that is naturally associated to the normalized classification margin. Owing to this interpretation, we show that normalized SGD iterates converge to the set of critical points of the normalized margin at late-stage training (i.e., assuming that the data is correctly classified with positive normalized margin). Up to our knowledge, this is the first extension of the analysis of Lyu and Li (2020) on the discrete dynamics of gradient descent to the nonsmooth and stochastic setting. Our main result applies to binary classification with exponential or logistic losses. We additionally discuss extensions to more general settings.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York (0.04)
- North America > United States > California > San Diego County > San Diego (0.04)
- (2 more...)
Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks
Tsilivis, Nikolaos, Vardi, Gal, Kempe, Julia
We study the implicit bias of the general family of steepest descent algorithms, which includes gradient descent, sign descent and coordinate descent, in deep homogeneous neural networks. We prove that an algorithm-dependent geometric margin starts increasing once the networks reach perfect training accuracy and characterize the late-stage bias of the algorithms. In particular, we define a generalized notion of stationarity for optimization problems and show that the algorithms progressively reduce a (generalized) Bregman divergence, which quantifies proximity to such stationary points of a margin-maximization problem. We then experimentally zoom into the trajectories of neural networks optimized with various steepest descent algorithms, highlighting connections to the implicit bias of Adam.
- North America > United States > California > Los Angeles County > Long Beach (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- (16 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Feature selection with gradient descent on two-layer networks in low-rotation regimes
This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois (0.04)
Directional convergence and alignment in deep learning
Recent efforts to rigorously analyze the optimization of deep networks have yielded many exciting developments, for instance the neural tangent (Jacot et al., 2018; Du et al., 2018; Allen-Zhu et al., 2018; Zou et al., 2018) and mean-field perspectives (Mei et al., 2019; Chizat and Bach, 2018). In these works, it is shown that small training or even testing error are possible for wide networks. The above theories, with finite width networks, usually require the weights to stay close to initialization in certain norms. By contrast, practitioners run their optimization methods as long as their computational budget allows (Shallue et al., 2018), and if the data can be perfectly classified, the parameters are guaranteed to diverge in norm to infinity (Lyu and Li, 2019). This raises a worry that the prediction surface can continually change during training; indeed, even on simple data, as in Figure 1, the prediction surface continues to change after perfect classification is achieved, and even with large width is not close to the maximum margin predictor from the neural tangent regime. If the prediction surface never stops changing, then the generalization behavior, adversarial stability, and other crucial properties of the predictor could also be unstable. In this paper, we resolve this worry by guaranteeing stable convergence behavior of deep networks as training proceeds, despite this growth of weight vectors to infinity. Concretely: 1. Directional convergence: the parameters converge in direction, which suffices to guarantee convergence of many other relevant quantities, such as the prediction margins.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan (0.04)